Exploring White Wines by David Manasco

For this project I will be exploring the data of the white wines database. This dataset contains around 4900 records with different quantitative variables and a quality variable that determines the expert option of the wine. I plan to explore what factors, if any, contribute to the quality of the wine.

## Number of observations in dataset:
## 4898
## Number of variables in dataset:
## 13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Univariate Plots Section

You can see from the chart that we have a normal distribution for quality of the wines. With the most common rating being a 6 followed by 5 then 7.

From the charts above we can see that most of the variables follow a normal distribution except for residual sugar. I cleaned the dataset and got rid of all the data above the 95% quantile. Once we have the cleaned data, i applied 3 different log transformations to understand the distribution. first was log base 10 then log base 2 then squareroot x scale.

The Residual sugar histogram now denotes a more normal distribution. One interesting thing it that the residual sugar histrogram has a bimodal look with multiple peaks.

Univariate Analysis

What is the structure of your dataset?

For this dataset there are 4,989 oberservations with 11 quantitative variables. Those variables are pH, Alcohol, fixed acidity, volatile acidity, citric acid, chlorides, free sulfur dioxide, total sulfur dioxide, density, residual sugar, and sulphates. There is also one subjective variable, quality, which gives a way to determine what factors into a better quality raiting.

What is/are the main feature(s) of interest in your dataset?

Personally, being a wine drinker, I’d love to determine what factors go into making a better quality wine. I want to find out which variables have a positve effect on quality and which ones have a negative effect on quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

At this point in the investigation proccess, it is important to investigate the replation ship between all variables. Any of the variables could impact the quality number positively. At this point I suspect that pH and alcohol will have a big effect on quality.

Did you create any new variables from existing variables in the dataset?

At this point the only variable I created is the bound sulfur dioxide variable. I created it taking the total sulfur dioxide and subtracting the free sulfur dioxide. I have only created a basic histogram for it, but I plan on using it during the bivariate analysis. ### Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this? The only features that have an unuasual distribution was residual sugar. This data has a distribution with two peaks around 2 and 8 this leads me to believe there are two types of white wines in the dataset, those with higher sugar content and those with a lower sugar content. I did have to clean up the dataset with regards to residual sugar. There was a major outlier at 65. So i cleaned up the data and only included the data points with those that are lower than 95% quantile. After the data was cleaned up, it was easier to see the distribution.

Bivariate Plots Section

This is the ggpairs plot for all of the variables. From this I decided that I needed to get the correlations in a different format so I examine what variables I want to focus on.

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
##                        sulphates     alcohol      quality
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831
## volatile.acidity     -0.03572815  0.06771794 -0.194722969
## citric.acid           0.06233094 -0.07572873 -0.009209091
## residual.sugar       -0.02666437 -0.45063122 -0.097576829
## chlorides             0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
## density               0.07449315 -0.78013762 -0.307123313
## pH                    0.15595150  0.12143210  0.099427246
## sulphates             1.00000000 -0.01743277  0.053677877
## alcohol              -0.01743277  1.00000000  0.435574715
## quality               0.05367788  0.43557472  1.000000000

From this we see that there are a few vairables that have a strong correlation on the other. Specifically the relationship between Residual Sugar, density, alcohol, and quality.

Next I created a new data frame that groups the data based on the quality rating. I am looking for the mean of the alcohol, pH, Density, and Residual Sugar. We can then plot those to see how the mean is affected by the quality of the wine.

We can see a strong correlation between residual sugar and the density of the wine. This makes sense as the more sugar is in each wine it would stand to reason that it would be more dense since Sugar is denser that water.

## # A tibble: 7 x 6
##   quality alcohol_mean pH_mean density_mean residualSugar_mean     n
##     <int>        <dbl>   <dbl>        <dbl>              <dbl> <int>
## 1       3        10.3     3.19        0.995               6.39    20
## 2       4        10.2     3.18        0.994               4.63   163
## 3       5         9.81    3.17        0.995               7.33  1457
## 4       6        10.6     3.19        0.994               6.44  2198
## 5       7        11.4     3.21        0.992               5.19   880
## 6       8        11.6     3.22        0.992               5.67   175
## 7       9        12.2     3.31        0.991               4.12     5

Next I broke the data set into a long format in order to calculate the mean of 4 variables by quality. I will use these means to see what trends exist in the data

From these plots we can see that as the quality of the wine increase so does the mean of alcohol and pH. Inversely we can see that as quality goes up the mean of density and residual sugar lowers

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

So While exploring this data set and looking for relationships, some correlations that stuck out to me would be that as quality rating of the wine increases so does the mean of the alcohol content. The same goes for pH mean as well. The higher the quality the more basic the pH is. The other interesting relationship that showed up was that as the density of the wine decreases the quality increases. This would lead me to reason that the lighter the wine is the better quality rating it recieves.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

One of the more interesting things to me that the relationship between residual sugar and density of the wine. This relationship stands to reason because sugar is more dense than just grape juice.

What was the strongest relationship you found?

The strongest relationship I found was that between residual sugar and density. With an r value of about .89 this was the most correlated relationship. The Second most fascinating relationship that I found would have to be between alcohol and quality. With an r value of about .43 this relationship shows us that the higher quality wines also have higher alcohol content.

Multivariate Plots Section

cleanSugarWines$quality <- cut(cleanSugarWines$quality,
                     breaks=c(-Inf, 5, 7, Inf),
                     labels=c("low","medium","high"))

For the multivariate plots I will first break the data up into quality groups. I decided that anything less than 5 or less is low quality, 6 and 7 are medium quality, and 8 and 9 are high quality wines. The breakdown we have is 1616 low quality wines, 3052 medium quality wines, and 180 high quality wines

From this chart we can see the correlations based on the quality of the wines. We see that the high quality wines have lower residual sugar content that the medium and low quality wines. The correlation between density and residual sugar for the medium quality lines have a slighty higher correlation vs the other qualities.

From this chart we see that there is a negative correlation between residual sugar and alcohol percentage. It is interesting to me that the higher quality wines have a higher alcohol content vs the other qualities. This leads us to reason that as residual sugar increases the alcohol content of the wine goes down.

From this chart we see that there is a slightly stronger correlation between alcohol and pH value for the higher quality wines. It is also interesting to see that as the alcohol increases so does the pH of the wine. I did not know the relationship between alcohol and pH until I explored this data set.

From this chart we see that the is a strong negative correlation between density and alcohol. So it seems as the density increases the alcohol content decreases. The highest rated wines have the highest alcohol content as well as the lowest density.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Some of the relationships I discovered while exploring the multivariate plots were the realtionship between alcohol and ph, as the pH increases so does the alcohol. Another interesting correlation was the relationship between residual sugar and alcohol. It seems that the less residual sugar a wine has, then the higher the alcohol will be.

Were there any interesting or surprising interactions between features?

The most interesting interaction to me was the relationship between alcohol and density. I had no idea that the more alcohol the wine has the less dense it will be. It all shows that the higher the alcohol content and lower the density the wine has the higher the quality rating is as well. So a High alcohol, low density wine should be very favorable in terms of rating.


Final Plots and Summary

Plot One

Description One

This chart shows a strong correlation between residual sugar and the density of the wine. This makes sense as the more sugar is in each wine it would stand to reason that it would be more dense since Sugar is denser that water. It is important to understand this relationship as it was the strongest correlation in the dataset at .89

Plot Two

Description Two

These charts show us that as the quality of the wine increase so does the mean of alcohol and pH. Inversely we can see that as quality goes up the mean of density and residual sugar lowers. These are 4 important variable choices to compare against quality. They show us people enjoy a wine that is more basic with a higher alcohol content.

Plot Three

Description Three

This chart shows us that there is a negative correlation between residual sugar and alcohol percentage. It is interesting to me that the higher quality wines have a higher alcohol content vs the other qualities. This leads us to reason that as residual sugar increases the alcohol content of the wine goes down.


Reflection

For this exploratory data analysis, I chose the White wines database given the fact that I love white wines. It was really interesting to be able to examine over 4800 different white wines with the checmial breakdowns. It allowed me to get a better understanding about what make a higher quality wine. Specifically a higher quality wine with generally have these traits; Higher Alcohol Content, Lower Residual Sugar, Lower Density, and a higher pH value. To me this is important to know because it can allow you to buy better quality wines. The higher the quality the more enjoyable the experience of drinking is.